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Abstract 

CN Subspace clustering and feature extraction are two of the most commonly used unsupervised 

f^ learning techniques in computer vision and pattern recognition. State-of-the-art techniques for sub- 

space clustering make use of recent advances in sparsity and rank minimization. However, existing 
Mh techniques are computationally expensive and may result in degenerate solutions that degrade clus- 



tering performance in the case of insufficient data sampling. To partially solve these problems, 
and inspired by existing work on matrix factorization, this paper proposes fixed-rank representation 
(FRR) as a unified framework for unsupervised visual learning. FRR is able to reveal the structure 
of multiple subspaces in closed-form when the data is noiseless. Furthermore, we prove that under 
some suitable conditions, even with insufficient observations, FRR can still reveal the true subspace 
memberships. To achieve robustness to outliers and noise, a sparse regularizer is introduced into the 
FRR framework. Beyond subspace clustering, FRR can be used for unsupervised feature extraction. 
As a non-trivial byproduct, a fast numerical solver is developed for FRR. Experimental results on 
both synthetic data and real applications validate our theoretical analysis and demonstrate the benefits 



(N 

OQ of FRR for unsupervised visual learning. 

cn 

o 

psj Index Terms 



Low-Rank Representation, Matrix Factorization, Motion Segmentation, Feature Extraction. 

L Introduction 

Clustering and embedding are two of the most important techniques for visual data analysis. In 
the last decade, inspired by the success of compressive sensing, there has been a growing interest in 
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2 TECHNICAL REPORT 

incorporating sparsity to visual learning, such as image/video processing |[ll, object classification ||2l, 
|[3|| and motion segmentation [4|. Early studies iQ, ||2l usually consider the ID sparsity (i.e., the 
nonzero entries of a vector, also known as the Iq norm) in their models. Recently, there has been a 
surge of methods HI, 0, Q which also consider the rank of a matrix as a 2D sparsity measure. 
However, it is difficult to directly solve these models due to the discrete nature of the /q norm and 
the rank function. A common strategy to alleviate this problem has been to use the li norm and the 
nuclear norm fS] as the convex surrogates of the Iq norm and the rank function, respectively. 

An important problem in unsupervised learning of visual data is subspace clustering. Recent 
advances in subspace clustering make use of sparsity-based techniques. For example, sparse subspace 
clustering (SSC) ||5|, ||9l, |[TOl uses the ID sparsest representation vectors produced by /i norm min- 
imization to define the affinity matrix of an undirected graph. Then subspace clustering is performed 
by spectral clustering techniques, such as normalized cut (NCut) liTTI . However, as SSC computes the 
sparsest representation of each points individually, there is no global structural constraint on the affinity 
matrix. This characteristic can degrade the clustering performance when data is grossly corrupted. 
Moreover, according to the theoretical work of [I12II . the within subspace connectivity assumption for 
SSC holds only for 2- and 3-dimensional subspaces. So SSC may probably over-segment subspaces 
when the dimensions are higher than 3. 

Low -rank representation (LRR) IH, Q, |[T3]| is another recently proposed sparsity-based subspace 
clustering model. The intuition behind LRR is to learn a low-rank representation of the data. The 
work by iTT?! shows that LRR is intrinsically equivalent to the shape interaction matrix (SIM) |fT5l in 
absence of noise. In this case, LRR can reveal the true clustering when the subspaces are independent 
and the data sampling is sufficient However, LRR suffers from some limitations as well. First, the 
nuclear norm minimization in LRR typically requires to calculate the singular value decomposition 
(SVD) at each iteration, which becomes computationally impractical as the scale of the problem grows. 
By combining a linearized version of alternating direction method (ADM) IJTTl with an acceleration 
technique for SVD computation, the work in lITSl proposed a fast solver, which significantly improves 
the speed for solving LRR. However, the SVD computation still cannot be completely avoided. 
Second, and more importantly, if the observations are insufficient, LRR (also SSC) may result in a 
degenerate solution that significantly degrades the clustering performance. The work in |fT6l introduces 
"hidden effects" to overcome this drawback. However, it is unclear whether such "hidden effects" can 
recover the multiple subspace structure for clustering. Moreover, introducing latent variables makes 

'The subspaces are independent if and only if the dimension of their direct sum is equal to the sum of their dimensions 1 14|. 
For each subspace, the data sampling is sufficient if and only if the rank of the data matrix is equal to the dimension of 
the subspace |16|. 
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the problem more complex and hard to optimize. 



The insufficient data sampling problem in SSC and LRR is similar in spirit to the small sample size 
problem, that is common in some subspace learning methods, such as linear discriminant analysis |[T9ll 
and canonical correlation analysis |20]. In these methods, if the number of samples is smaller than 
the dimension of the features, the covariance matrices are rank deficient. Three are the common 
approaches to solve this problem 11211 : dimensionality reduction, regularization and factorization (i.e., 
explicitly parameterize the projection matrix as the product of low-rank matrices). In this paper, we 
incorporate the factorization idea into representation learning and propose fixed-rank representation 
(FRR) to partially solve the problems in existing unsupervised visual learning models. FRR has three 
main benefits: 



• Unlike SSC and LRR, which use the sparsest and lowest rank representations, FRR explicitly 
parameterizes the representation matrix as the product of two low-rank matrices. When there is 
no noise and the data sampling is sufficient, we prove that the FRR solution is also the optimal 
solution to LRR. In this case, FRR can reveal the multiple subspace structure. Furthermore, 
we prove that under some suitable conditions, even when the data sampling is insufficient, the 
memberships of samples to each subspace still can be identified by FRR. A sparse regularizer 
is introduced to FRR to model both small noises and gross outliers, which provides robustness 
to FRR in real applications. 

• The most expensive computational component in LRR is to perform SVD at each iteration. 
Even with some acceleration techniques, the scalability of the nuclear norm minimization is still 
limited by the computational complexity of SVD. In contrast, FRR avoids SVD computation 
and can be efficiently applied to large-scale problems. 

• FRR can also be extended for unsupervised feature extraction. By considering a transposed 
version of FRR (TFRR), we show that FRR is related to existing feature extraction methods, 
such as principal component analysis (PC A) ll22l . ll23l . Indeed, our analysis provides a unified 
framework to understand single subspace feature extraction and multiple subspace clustering by 
analyzing the column and row spaces of the data. 

February 25, 2013 DRAFT 
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II. A Review of Previous Work 

Given a data sejj X = [Xi,X2, • • • ,Xfc] G M*^^" drawn from a union of k subspaces {Ci}^^^, 
where Xj is a collection of Ui data points sampled from the subspace Cj with an unknown dimension 
dci, the goal of subspace clustering is to cluster data points into their respective subspaces. This 
section provides a review of SSC and LRR for solving this problem. To clearly understand the 
mechanism of these methods, we first consider the case when the data is noise-free. From now on, 
we always write X = Ux^xV^ and rx as the compact SVD and the rank of X, respectively. 

A. Sparse Subspace Clustering (SSC) 

SSC im, 191, lITOl is based on the idea that each data point in the subspace Cj should be represented 
as a linear combination of other points that are also in d. Using this intuition, SSC finds the sparsest 
representation coefficients Z = [[Z]i, [ZJ2, • • • , [Z]„] by considering the sequence of optimization 
problems 

min||[Z],||i, s.t. [X]i=X[Z]i, [Z],, = 0, (1) 

[Z]i 

where i = 1, 2, • • • , n. Then one can use Z to define the affinity matrix of an undirected graph as 
(|Z| + |Z^|) and perform NCut on this graph, where |Z| denotes a matrix whose entries are the 
absolute values of Z. The SSC model can also be rewritten in matrix form as 

min||Z||i, s.i. X = XZ, [Z]ii = 0. (2) 

z 

Note that both li norm minimization models ([T]l and (J2]l can only be solved numerically. 

B. Low-Rank Representation (LRR) 

By extending the sparsity measure from ID to 2D for the representation, LRR O, Q, |[T3l proposes 
a low-rank based criterion for subspace clustering. By utilizing the nuclear norm as a surrogate for 
the rank function, LRR solves the following nuclear norm minimization problem 

min||Z|L, s.t. X = XZ. (3) 

z 

Unlike SSC, which can only be solved numerically, V^V^ (also known as SIM |[T5l ). which has 
a block-diagonal structure, is the closed-form solution to Q |14]. Although |[T4l has proved this, in 
the following section, we will provide a simpler derivation, that provides new insights into LRR. 

^Bold capital letters (e.g., M) denote matrices. The range and the null spaces of M are defined as 7?.(M) := {a|3b, a = 
Mb} and A/'(M) := {a|Ma = 0}, respectively. [M]ij and [M]i denote the (i, j)-th entry and the i-th column of M, 
respectively. M^ denotes the Moore-Penrose pseudoinverse of M. The block-diagonal matrix formed by a collection of 
matrices Mi,M2, ..., Mfc is denoted by diag(Mi, M2, ...,Mfc). 1„ is the all-one column vector of length n. I„ is the 
n X n identity matrix. (■, ■) denotes the inner product of two matrices. A variety of norms on matrix and vector will be 
used. II ■ II F is the Frobenius norm, || ■ ||, is the nuclear norm (8), || • ||2,i is the /2,i norm [24|, || • || is the spectral norm, 
II • 111, II • II2 and II ■ ||oo are the h, h and loo norms, respectively. 
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III. Fixed-Rank Representation 

In this section, we propose a new model, named fixed-rank representation (FRR), for subspace 
clustering. We start with the following analysis on LRR. 

A. Motivation 

To better understand the mechanism of LRR and illustrate our motivation, we show that Vx V^ G 
7^(X^) is the optimal solution to LRR in a simple wajJj By the identity X = XX^^X and the 
constraint in ^, we have X = XZ = XX+XZ = XX+X. Thus X+X = V^V^^ is a feasible 
solution to (BJ). So the general form of the solution is Z = V^V^ + Z„, where Z„ G A/'(X). As 
7^(X^) _L A/'(X), we have V^Z„ = 0. This together with the duality definition of nuclear norm HI 
leads the following inequality 

||Z||, = max^(Z,Y) > {Z,Yx^l)=rx = \Wx^l\U. 

This concludes that V^V^ is the minimizer to (pi). 

The first observation from the prevous analysis is that LRR can successfully remove the effects 
from A/'(X) to obtain a block-diagonal matrix when the data sampling is sufficient. However, it is also 
observed that the "lowest rank" representation in LRR is actually the largest rank matrix within the 
row space of X, namely the rank of this representation is always equal to the dimension of the row 
space. Therefore, the lack of observations for each subspace may significantly degrade the clustering 
performance. For example, due to insufficient data sampling, the dimension of the row space may 
be equal to the number of samples (i.e., rx = n < d). In this case, the optimal solution to ([3]) may 
reduce to an identity matrix and thus LRR may fail. See Fig. [T] as an example. 

An obvious question is whether we can find a lower rank representation in the row space of the 
data set to exactly reveal the subspace memberships for clustering, even when the data sampling is 
insufficient. In the following subsection, we give a positive answer to this question. 

B. The Basic Model 

The key idea of FRR is to minimize the Frobenius norm of the representation Z instead of the 
nuclear norm as in LRR. FRR simultaneously computes a fixed lower rank representation Z (hereafter 
we write rank(Z) = m). That is, we jointly optimize Z and Z as 

min ||Z - Zfp, s.t. X = XZ, rank(Z) = m. (4) 

z,z 

^Note that here we only analyze the optimality of Vx Vj^ to 1 3 i, not its uniqueness. 
February 25, 2013 DRAFT 
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Obviously, Z can be expressed, non-uniquely, as a matrix product Z = LR, where L € R"^™ and 
R G M™^". Replacing Z by LR, we arrive at our basic FRR model 

min IIZ - LRlIp, s.t. X = XZ. (5) 

Z,L,R 

In the following sections, we will analyze the problem ([5]), show properties of the solution to (jSj, 
and extend it for real applications. 

C. Analysis on the Basic Model 

At first sight, the factorization of Z leads to a non-convex optimization problem which may prevent 
one from getting a global solution. The difficulty results from the fact that the minimizer is non- 
unique. Fortunately, in the following theorem, we prove that one can always obtain a globally optimal 
solution to Q in closed-form. 

Theorem 1: Let [\ x]i:m = [[Vx]i, [Vx]2, • • • , [V^]™]. Then for any fixed m < rx, (Z*, L*, R*) : 
(VxV^^, [Vx]i:mi [Vx]^m) i^ ^ globally optimal solution to (pi) and the minimum objective function 
value is {rx — m). 
The proof of this theorem is based on the following lemma. 

Lemma 2: (Courant-Fischer Minimax Theorem |[25l ) For any symmetric matrix A G M"^", we 
have that 

Ai(A) = max min y'^Ay/y^y, for i = 1, 2, ...,n, 

dim(5)=i O^^ye^ 

where 5 C M" is some subspace and Ai(A) is the i-th largest eigenvalue of A. 
Proof: First, by the well known Eckart- Young theorem |26], given Z, we have 

d 

min||Z-LR||^= V af{Z), (6) 

Lj,R, 

i=m+l 

where (Ti(Z) is the i-th largest singular value of Z. Now we prove that 

if X = XZ then a^^ (Z) > L (7) 

By X = XZ, we have that rank(Z) > rx- Then (J6]l and (jvl) imply that the minimum objective 
function value is no less than rx — m. Indeed, by the compact SVD of X and X = XZ, we have 

V^ = VjZ, (8) 

By Lemmapl o"j(Z) = max min llZ^ylU/llylb, where 11 • llo is the lo norm of a vector. So by 

U dim(5)=J0^ye5 

choosing S = 7^(Vx) and utilizing (Isll, 

a,,(Z) > ^^min ||Z'^y||2/||y||2 

= min||Z^Vxb||2/||Vxb||2 (9) 

= mill ||Vxb||2/||Vxb||2 = 1. 
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Next, when Z = Vx V^, it can be easily checked that the objective function value is {rx —m). Again, 
by Eckart- Young theorem, LR = \V x\i:m\^ x]l;m- Thus we have (V^Vj, \V x\i:m, [Vx]?^™) is a 
globally optimal solution to Q, thereby completing the proof of the theorem. ■ 

Based on Theorem [T] we can derive the following corollary to illustrate the structure of the optimal 
solution to Q. 

Corollary 3: Under the assumption that subspaces are independent and data X is clean, there exists 
a globally optimal solution (Z*,L*,R*) to problem ^ with the following structure: 

Z*=diag(Zi,Z2,...,Zfc), (10) 

where Zj is an rii x rij matrix with rank(Zj) = dc, and 

L*R* G 7e(Z*) = 7^(X^). (11) 



The proof of this corollary is based on the following lemma. 

rT 



Lemma 4: [15] Let X = Ux^xV^ be the compact SVD. Under the same assumption in Corol- 



lary pi VxV^ is a block diagonal matrix that has exactly k blocks. Moreover, the i-th block on its 
diagonal is an n^ x rii matrix with rank dc,- 

Proof: By the proof of Theorem [11 we have that Z* = Vx V^^ is a global optimal solution to pn 
and any global optimal L* and R* are in the range space TZ{Z*). So we have that L*R* G TZ{Z*) = 
7^(X^). By Lemma ffl we achieve the block diagonal structure (10) for Z*, which concludes the 
proof. ■ 

However, such Z* suffers from the same limitation of LRR. Namely, when the data sampling is 
insufficient, Z* will probably degenerate and thus the clustering may fail. 



Fortunately, as shown in ( 11 1, L*R* can still be spanned by the row space of X. This inspires us 
to consider this lower rank representation for subspace clustering. 

Corollary 5: Assuming that the columns of Z* are normalized (i.e. l^Z* = 1^) and fix tti = A;, 
then there exists globally optimal L* and R* to problem ([5]) such that 

L*R* =diag(nil„,l^^,n2l„,l^^,...,nfcl„,l^J. (12) 

Remark: Corollary |5] does not guarantee that an arbitrary rank-A; optimal solution has the block- 
diagonal structure ( [l2] ) due to the non-unique of the minimizer (L*, R*). However, in our experiments, 
we have observed that empirically choosing the first k columns of Vx works well on the tested data 
(e.g.. Fig. [1). 

Proof: By Corollary [3] and the normalization assumption, Z* = diag{Zl, Z2, •••,Zp, where Z* 
is an rii x ni for subspace Cj and 1„^ is an eigenvector of Z* with eigenvalue 1. Thus there exists a 
basis H = [hi, h2, ..., hfc], each vector of which with the form h, = [0, 1^^, 0]^ is eigenvector of Z 

February 25, 2013 DRAFT 
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with eigenvalue 1. By the Eckart- Young theorem (similar to the proof of Theorem [TJ, we have that 
L* = H and R* = H^ are global optimal solutions to (JSJI, which directly leads ( [l2] ). ■ 

In principle, the normalization of Z* could be considered as a strong assumption, hence it cannot 
always be guaranteed in real situations. Therefore, we explicitly enforce each column of Z to sum 
to one 

min IIZ - LRllL s.t. X = XZ, l^Z = l^'. (13) 

Z,L,r" -^ n n 

D. Sparse Regularization for Corruptions 

In real applications, the data are often corrupted by both small noises and gross outliers. In the 
following, we show how to extend problem ( [T3] ) to deal with corruptions. By modeling corruptions 
as a new term E, we consider the following regularized optimization problem 

min IIZ — LRlIp + ullElL, 

Z,L,R,E ^ (-J4) 

S.t. X = XZ + E, 1^Z = 1^, 

where the parameter /x > is used to balance the effects of the two terms and || • ||s is a sparse norm 
corresponding to our assumption on E. Here we adopt the /2,i norm to characterize the corruptions 
since it can successfully identify the indices of the outliers and remove small noises [I27II . Algorithm [T] 
summarizes the whole FRR based subspace clustering framework. 

Algorithm 1 FRR for Subspace Clustering 

Input: Let X G M"^^" be a set of data points sampled from k subspaces. 



Step 1: Solve ([14]) to obtain (Z*, L*, R*). 

Step 2: Construct a graph by using (|Z*| + |(Z*)^|) or (|L*R*| + |(L*R*)^|) as the affinity matrix. 

Step 3: Apply NCut to this graph to obtain the clustering. 

IV. Extending FRR for Feature Extraction 

Besides subspace clustering, the mechanism of FRR can also be applied for feature extraction. 
That is, one can recover the column space of the data set by solving the following transposed FRR 
(TERR) 

min ||Z - LR|||., s.t. X = ZX, (15) 

Z,L,I\, 

where m < rx, L E M''^™, R e ]^mxd ^^^ 2 g R'^^^. For noisy data, by using similar techniques 



as in Section III-D we introduce an explicit corruption term E into the objective function and the 



constraint. Hence we obtain the robust version of TERR for feature extraction 

min ||Z-LR||p + Ai||E|L, s.t. X = ZX + E. (16) 

z,l,r,e" "^ '^" " ' ^ ^ 
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A. Relationship to Principal Component Analysis 

Principal component analysis (PCA) is one of the most popular dimensionality reduction tech- 
niques II22I . Il23l . The basic ideas behind PCA date back to Pearson in 1901 |[22l . and a more general 
procedure was described by Hotelling ll23l in 1933. There are several energy functions which lead to 
subspace spanned by the principal components Ii21il . For instance, PCA finds the matrix P G ]g'^x'" 
that minimizes: 



min||X-PP'^X|||,, si. P'^P = Im. (17) 



It can be shown that P* = [Ux ]i:m is the optimal solution to ( 17 1, where [Ux]i:m = [[Ux]i, [Ux]2, • • • , [Ux]r 
The following corollary shows that the mechanism of TFRR can also be applied to formulate PCA. 

Corollary 6: For any fixed m < rx, (Z*,L*,R*) := (UxU^^, [Ux]i:m, [Vx]l,m) is a globally 



optimal solution to ( 15l and the minimum objective function value is {rx — m). 



Proof: The proof of Theorem [T] directly leads to the above corollary. 



V. Optimization for FRR 



In this section, we develop a fast numerical solver for FRR related models by extending the classic 
alternating direction method (ADM) IITtII to non-convex problems. To solve the problem ( 14| 



we 



introduce Lagrange multipliers A and 11 to remove the equality constraints. The resulting augmented 
Lagrangian function is 

/:A(Z,L,R,E,A,n) = ||Z-LR||2, + /i||E||2,i 

+(A,X-XZ-E) + (n,l^Z-l^) (18) 

+f (||X - XZ - E||2, + lll^Z - InWl), 



where /3 > is a penalty parameter. It is important to note that although ( 18 1 is not jointly convex for 
all variables, it is convex with respect to each variable while fixing the others. This property allows 
the iteration scheme to be well defined. So we minimize ([TS]) with respect to L, R, Z, and E one at 



a time while fixing the others at their latest values, and then update the Lagrange multipliers A and 



''As other FRR related models can be solved in similar way, we do not further explore them in this section. 
February 25, 2013 DRAFT 
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n: 



R^ 
Z^ 

/3^ 



ZRt = ZR^(RR^)t 



Liz 



(LiL+)TL;Z, 



(2I„ + /3(X^X + l„l^))-iB, 

argmin/i||E||2,i + -||C - E|||^, 
E 2 

A + /3(X-XZ+-E+), 

n + /3(i^z+-0, 

mm(/3,p/3), 



(19) 
(20) 
(21) 
(22) 
(23) 
(24) 
(25) 



where the subscript + denotes that the values are updated, /? is the upper bound of /?, p > 1 is the step 
length parameter, B = 2L+R++/3(X'^X-X^(E-A//3))+^l„l^-l„n and C = X-XZ++A//3. 



The subproblem (22 1 can be solved by Lemma 3.2 in |6|. We then reduce the computational cost for 



solving ([T9]l and (|20]l. It follows from (|20]) that 

L+R+ = L+(L^L+)tL^Z = Pl+(Z). (26) 

By considering the compact SVD: R = '[JuT.rY'^, we have L+ = ZVij^S^JU^J and ZR"^ = 
Z\r^T,rV^j^. This implies that 7^(L+) = 7^(ZR^) = n{Z\ r^.) and 

L+R+ = P^fiT(Z), (27) 

where P^^jt is the orthogonal projection into 7^(ZR-^). Since the objective function of (14i depends 



on the product L_(_R_|_, different values of L+ and R_|_ are essentially equivalent as long as they give 



the same product. The identity (27 1 shows that the inversion (RR^)^^ and (L^L_|_)1^ can be saved 



when the projection Vzk^ is computed. Specifically, one can compute Vzr^ = QQ^, where Q is 
the QR factorization of ZR^. Then we have L+R+ = QQ^Z and one can derive: 



R^ 



Q, 



Q^Z. 



(28) 
(29) 



The schemes (28 1 and (29 1 are often preferred since computing (29 1 by QR factorization is generally 
more stable than solving the normal equations ||28l . The complete algorithm is summarized in 
Algorithm (2) 

VI. Experimental Results 

This section compared the performance of FRR against state-of-the-art algorithms on both subspace 
clustering and feature extraction. All experiments are performed on a notebook computer with an Intel 
Core i7 CPU at 2.00 GHz and 6GB of memory, running Windows 7 and Matlab version 7.10. 
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Algorithm 2 Solving ([14]) by ADM-type Algorithm 



Input: Observation matrix X G W^^'"-, m > 0, ei, e2 > 0, parameters /3 > and p > 1. 
Initialization: Initialize Zq € M"''", Lq G M"^"", Rq G M™^", Eq G M^""", Aq G K''''" and 
Ho G M^''". 
while not converged do 



Step 1: Update (Z, L, R, E, A, H) by (|28]), (|29]l and (|2T|-(|25l 

Step 2: Check the convergence conditions: 

||X - XZ+ - E+lloo < ei and ||1^Z+ - I^IU < £2- 

end while 

Output: Z*, L*, R* and E*. 



A. Subspace Clustering 

We first consider the subspace clustering problem, and compare the clustering performance and 
computational speed of FRR to existing state-of-the-art methods, such as SIM, Random Sample 
Consensus (RANSAC) ||29l. Local Subspace Analysis (LSA) [30|, SSC and LRR. As shown in 
Section [IIIj both Z and LR can be utilized for clustering, we call these two strategies FRRi and 
FRR2, respectively. 

1) Synthetic Data: We performed subspace clustering on synthetic data to illustrate the insufficient 



data sampling problem (to verify the analysis in Section III I. Let k, p, dh and di denote the number 
of subspaces, the number of points in each subspace, the features (i.e., observed dimension) and the 
intrinsic dimension of the subspace, respectively. Then the data set, parameterized as {k,p,dh,di), 
is generated by the same procedure in [6|: k independent subspaces {Ci}f^i are constructed, whose 
basis {U}^!^]^ are computed by Uj+i = TUj, I < i < k — 1, where T is a random rotation and Ui is 
a random column orthogonal matrix of dimension d^ x di. Then we construct a. d^ x kp data matrix 
X = [Xi,X2, ...,Xfc] by sampling p data vectors from each subspace by Xj = UjCj, I < i < k, 
with Cj being a di x p matrix with uniform distribution. To generate the point set for insufficient 
data sampling clustering, we fix /c = 10, d^ = 100 and di = 50 and vary p G [10,30]. In this way, 
the number of samples in each subspace (at most 30) is less than the intrinsic dimension (50 for each 
subspace). 

Fig. [1] illustrated the structures of Z = V^V^ and LR = [Vx]i:k[Vx]J.k when p = 10. Since the 
data sampling is insufficient, the optimal Z for Q and (|5]l reduces to I„ (see Fig. [T] (a)). In contrast, 
LR can successfully reveal the multiple subspace structure (see Fig. [T] (b)). 

We also compared the clustering performances of Z and LR on the generated data. Fig. [2] shows 
the clustering accuracy as a function of the number of points. It can be seen that the clustering 

February 25, 2013 DRAFT 
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(a) Z (b) LR 

Fig. 1. The structures of Z and LR, where ranfc(Z) = rankCX.) — kp = 100 and ranfc(LR) = A; = 10, respectively. 
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Fig. 2. The mean and std. clustering accuracies (%) of Z and LR over 20 runs. The s-axis represents the number of 
samples in each subspace and the y-axis represents the clustering accuracy. 



accuracy of Z is very sensitive to the particular sampling. Although it performs better when p is 
increasing, the highest clustering accuracy is only around 80% (p = 30). In contrast, LR achieves 
almost perfect results on all data sets. This confirms that the affinity matrix calculated from LR can 
successfully overcome the drawback of using Z in Q and LRR (also SIM) when the data sampling 
is insufficient. 

2) Motion Segmentation: Motion segmentation refers to the problem of segmenting tracked feature 
point trajectories of multiple moving objects in a video sequence. As shown in H, all the tracked 
points from a single rigid motion lie in a four-dimensional linear subspace. So this task can be regarded 
as a subspace clustering problem. We perform the experiments on the Hopkins 155 database [31], 
which is an extensive benchmark for motion segmentation. This database consists of 156 sequences 
of two or three motions thus there are 156 clustering tasks in total. For a fair comparison, we apply 
all algorithms to the raw data and the parameters of these methods have been tuned to the best. 



We reported the segmentation errors in Table III and presented the percentage of sequences for 
which the segmentation error is less than or equal to a given percentage of misclassification in Fig. [3] 
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Fig. 3. Percentage of sequences for which the segmentation error is less than or equal to a given percentage of 
misclassification. 

TABLE I 

Segmentation errors (%) on Hopkins 155 raw data. 



Method 


2 Motions 


3 Motions 


All (156) 


mean 


median 


std. 


max. 


mean 


median 


std. 


max. 


mean 


median 


std. 


max. 


SIM 


24.1 


24.8 


15.4 


49.2 


27.9 


28.5 


15.8 


64.1 


25.1 


25.3 


15.7 


64.1 


RANSAC 


9.6 


3.3 


13.1 


49.3 


13.8 


7.8 


13.7 


44.7 


10.8 


4.2 


13.5 


49.3 


LSA 


6.8 


2.8 


8.0 


40.9 


16.8 


15.6 


12.6 


46.6 


9.1 


4.8 


10.1 


46.6 


SSC 


3.7 


0.0 


9.7 


49.9 


11.4 


3.3 


15.0 


44.6 


5.5 


0.0 


11.6 


49.9 


LRR 


3.2 


0.3 


8.2 


40.3 


7.8 


2.8 


10.3 


41.5 


4.3 


0.6 


8.9 


41.5 


FRRi 


2.5 


0.0 


7.4 


40.8 


5.9 


1.4 


10.9 


39.4 


3.5 


0.0 


8.9 


41.8 


FRR2 


1.8 


0.0 


5.3 


36.1 


4.7 


1.0 


9.1 


41.5 


2.6 


0.0 


6.5 


41.5 



It can be noticed that the performances of three sparsity-based models (i.e., SSC, LRR and FRR) 
are better than other methods. SSC is worse than LRR because the ID l\ norm based criterion finds 
the representation coefficients of each vector individually, and there is no global constrain. Although 
the basic forms of LRR Q and FRR ([5]l share the same optimal solution to Z, FRRi performs even 



better than LRR in real data set. This is because enforcing the normalization constraint in ( 14 1 can 
improve the performance for clustering. Overall, FRR2 outperforms all other methods in this paper. 
This result, again, confirms that LR in FRR2 is better than the general Z in LRR and FRRi for 
subspace clustering. 

For three sparsity-based methods. Table |ll] reports the time in seconds. We can see that the 
computational time of SSC is lower than the standard LRR. This is because the l\ norm minimizations 
in SSC can be solved in parallel and there is only a thresholding process needed at each iteration. 
While LRR is solved with an SVD in each iteration, and it does not scale well with large number of 
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Fig. 4. Examples of the FRGC-Caltech data set. The top two rows correspond to face images and the bottom row non-face 
images. 



samples. By combining linearized ADM with an acceleration technique for SVD, the work in 11181 
proposed a fast solver for LRR. The running time of this approach is even less than SSC. Our FRR, 
again, achieves the highest efficiency because it completely avoids SVD computation in the iterations. 



TABLE II 

The average running time (seconds) per sequence for three SPARSITY-BASED METHODS. LRR(A) DENOTES 

THE ACCELERATED LRR PROPOSED IN iTTSll. 



Method 


2 Motions 


3 Motions 


All (156) 


SSC 


3.5445 


7.8493 


4.5057 


LRR 

LRR(A) 


38.5156 
1.9415 


115.3140 
3.6788 


55.6259 
2.3319 


FRR 


0.9990 


2.2799 


1.2847 



B. Feature Extraction and Outlier Detection 

This experiment tested the effectiveness of TFRR for feature extraction in presence of occlusions. 
To simulate sample-outliers, we created a dataset by combining images with faces from the FRGC 
version 2 ||32ll and images non containing faces from Caltech-256 |[33l . We selected 20 images for the 
first 180 subjects of the FRGC database, having a total of 3600 images. For Caltech-256 database, 
which contains 257 image categories, we randomly selected 1 image from each class (a total of 
257 non-facial images). All images are resized to 32 x 36 and the pixel values are normalized to 
[0, 1]. As shown in Fig. |4J there are two types of corruptions: small errors in the facial images (e.g., 
illuminations and occlusions) and non-facial outliers. 

The goal of this task is to robustly extract facial features and use them for classification. That is, we 
learn a mapping P between high dimensional observations and low dimensional features using TFRR, 
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and identify outliers in the training set by E. Then for a new testing data x, the feature vector y can 
be computed as y = Px. We selected the first /c (A; = 40, 80) identities and 257 non-facial images 
as the training set and the remaining (180 — k) identities of facial images for test. We compared 
two TFRR based strategies (one is directly using P = Z, called TFRRi, and another is computing 
the orthogonal basis P = orth(LR), called TFRR2) with the "Raw data" baseline and other state- 
of-the-art approaches, such as PCA, Locality Preserving Projection (LPP) ll34l and Neighborhood 
Preserving Embedding (NPE) [35|. The parameters and the feature dimensions of all methods are 



tuned to the best for each training set. Table III demonstrates that the performances of TFRRi and 



TFRR2 are both significantly better than the baseline and PCA. Moreover, TFRR2 outperforms all 
other methods on these experiments. 

As shown in Fig. [5j the main advantage of TFRR based methods comes from their ability of 
extracting intrinsic facial features and removing outliers. One can see that most of the intrinsic facial 
features can be projected into the range space (modeled by ZX, see the middle row), while the small 
errors of the facial images (e.g., illuminations and occlusions) and non-facial outliers (modeled by 
E) can be automatically removed (see the bottom row). 

TABLE III 

Classification accuracies (mean ± std.%) on FRGC-Caltech data set. "Gm/Pn" means in the testing 

DATA m IMAGES OF EACH SUBIECT ARE RANDOMLY SELECTED AS GALLERY SET AND THE REMAINING n IMAGES AS 

PROBE SET. Such a trial is repeated 20 times. The feature dimensions are: PCA (410D, 358D), LPP (170D, 

200D), NPE(320D, 160D) AND TFRR2 (190D, lOOD). THE DIMENSION OF THE feature vector PRODUCED BY 

TFRRi IS THE SAME AS THE OBSERVED DATA. 



Train 


Test 


Raw 


PCA 


LPP 


NPE 


TFRRi 


TFRR2 


40 X 20 + 257 


G5/P15 
GIO/PIO 


71.1 ± 3.2 
82.8 ± 4.6 


70.0 ± 3.2 
81.6 ± 4.6 


85.2 ± 2.4 
92.2 ± 2.8 


81.1 ± 2.7 
89.6 ± 3.6 


81.5 ± 2.0 
89.9 ± 2.7 


88.8 ± 2.7 
94.1 ± 2.1 


80 X 20 + 257 


G5/P15 
GlO/PlO 


72.3 ± 4.1 
82.6 ± 3.2 


71.4 ±4.1 
81.6 ± 3.2 


85.4 ± 2.9 
91.4 ± 3.2 


83.7 ± 4.2 
90.4 ± 3.2 


82.9 ± 3.3 
90.1 ± 2.1 


90.8 ± 2.1 

94.9 ± 2.9 



Fig. [6] plotted the energies (in terms of I2 norm) for the columns of E. One can see that the 
values of non-facial samples (last 257 columns in E) are obviously larger than that of facial samples. 
Therefore, the error term E can also be used to detect the non-facial outliers. Namely the i-th sample 
in X is considered as outlier if and only if || [E]j||2 > 7. By setting the parameter 7 = 2.2, the outlier 
detection accuracie^ are 98.68% on the 40 x 20 + 257 data and 99.19% on 80 x 20 + 257 data, 
respectively. 

'These accuracies are obtained by computing the percentage of correctly identified outliers. One may also consider the 
receiver operator characteristic (ROC) and compute its area under curve (AUC) 1 14] to evaluate the performance. 



February 25, 2013 



DRAFT 



16 



TECHNICAL REPORT 



X: 



ZX: 



E: 




Fig. 5. Some examples of using TFRR to recover the intrinsic facial features and remove small errors and outliers (modeled 
by X = ZX + E). The left two columns correspond to facial samples and the right two are non-facial samples. The middle 
row shows the features extracted by our algorithm (ZX) and the bottom row shows the corruptions (E). 




(a) 40 X 20 + 257 



(b) 80 X 20 + 257 



Fig. 6. The h norm for the columns of E. The first 800 (a) and 1600 (b) columns are facial images and the last 257 
columns are outliers. 



VII. Conclusions 

This paper proposed a novel framework, named fixed-rank representation (FRR), for robust unsu- 
pervised visual learning. We proved that FRR can reveal the multiple subspace structure for clustering, 
even with insufficient observations. We also demonstrated that the transposed FRR (TFRR) can 
successfully recover the column space, and thus can be applied for feature extraction. There remain 
several directions for future work: 1) provide a deeper analysis on LR (e.g., the general strategy for 
choosing efficient basis from TZ{Z) for subspace clustering and determining dimension for feature 
extraction), 2) apply FRR to supervised and semi-supervised learning. 
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